Model Selection

Multimodal video understanding

# Multimodal video understanding

Qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a powerful vision-language model with enhanced mathematical and problem-solving abilities, suitable for multimodal tasks.

Image-to-Text English

Xclip Large Patch14 Kinetics 600

X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs through contrastive learning.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase